The browserly module is designed to easly create track view visualizations of positional genomic data such as ChIP-seq, ATAC-seq, or DNase hypersesitivey assays. High-level functions are provided to easly make visualizations that the user can use to compare results between samples, conditions, and even genomic loci.
A trackview requires three components to display properly.
Browserly needs to know the range to plot. This is used to pull the correct annotation information. The function get_view_range makes the appropriate object for use with browserly. If passed an OrganismDbi object, get_view_range can also be used to get the range by gene symbol. Here we’ll use the region for the gene GREB1, because it is a good responder to ER and PR.
view_range <- get_view_range(chr = 2, start=11482341, end=11642788)
view_range
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 2 [11482341, 11642788] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The range can also be extended if desired. This is a particularly useful function when creating uniformly sized views around disparate regions. For example, the TSS of many genes can be made as 1bp GRanges and extended to be of equal length.
The range can be extended symmetrically on both sides
extend_grange(view_range, 50000)
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 2 [11432341, 11692788] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
or asymmetrically by passing a vector
extend_grange(view_range, c(1000, 5000))
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 2 [11481341, 11647788] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We pre-load the annotation information from the TxDB into an easily searched list. This saves time when repeated calls are made to find the genomic ranges and features that are used in plotting coverage information. The variable tx_data contains several lists that a indexed by transcript feature (e.g. intron, utr5). It also keeps track of the seqlevelstyle used in the database.
txdb <- TxDb.Hsapiens.BioMart.igis
tx_data <- load_tx_data(txdb)
print(names(tx_data))
## [1] "intron" "utr5" "utr3" "cds"
## [5] "exon" "transcripts" "seqlevelsStyle"
Note: We also use the annotation helper function match_tx_to_gene to zip together OrgDB and TxDB objects by entrez id. This allows genes to be searched by gene symbol, which is often more human friendly.
symbol_table <- match_tx_to_gene(txdb, 'org.Hs.eg')
print(head(symbol_table))
## # A tibble: 6 × 6
## entrez chr start end strand symbol
## <chr> <fctr> <int> <int> <fctr> <chr>
## 1 1 19 58346806 58353499 - A1BG
## 2 10 8 18391245 18401213 + NAT2
## 3 100 20 44619519 44651735 - ADA
## 4 1000 18 27950966 28177481 - CDH2
## 5 10000 1 243488233 243851079 - AKT3
## 6 100008586 X 49570366 49577757 + GAGE12F
Users are expected to provide their own data for plotting. Data should be provided as a GRanges object, where the metadata columns (mcols) contain the coverage values in each range.
The utility function make_coverage_tracks is provided for convenience, and has a few helpful features. It takes a list of coverage files (e.g. bigwig, bam) and the view_range to be displayed. Optionally, the user can provide corresponding names for the samples (e.g. treatment condition, antibody), and a scaling factor for the coverage data (e.g. library size).
# Get the coverage
cvg <- make_coverage_tracks(fi[idx, "File.bw"],
target_range = view_range,
sample_names = shortened_names,
scaling_factors = sf)
head(cvg)
## GRanges object with 6 ranges and 5 metadata columns:
## seqnames ranges strand | ER_input
## <Rle> <IRanges> <Rle> | <numeric>
## [1] 2 [11482341, 11482439] * | 0.0141685398538994
## [2] 2 [11482440, 11482539] * | 0.00255033717370189
## [3] 2 [11482540, 11482639] * | 0
## [4] 2 [11482640, 11482739] * | 0
## [5] 2 [11482740, 11482839] * | 0
## [6] 2 [11482840, 11482939] * | 0
## ER_ab ER_ab_Pr_treat PR_ab
## <numeric> <numeric> <numeric>
## [1] 0.136749081577534 0.117184084720934 0.0917524797025164
## [2] 0.0107314675603833 0.120886707902033 0.0917524797025164
## [3] 0.0825497504644868 0.261271271917296 0.0458762398512582
## [4] 0.161797510910394 0.389957122264621 0.170659612246681
## [5] 0.141985570798917 0.366559694928744 0.183504959405033
## [6] 0.0734692779133933 0.190104097104003 0.0954225788906171
## PR_ab_Pr_treat
## <numeric>
## [1] 0
## [2] 0.0515054944045998
## [3] 0.174670807111252
## [4] 0.223936932193912
## [5] 0.186987338381917
## [6] 0.063822025675265
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
A common representation of coverage is a trackview. Browserly makes it simple to create an interactive trackview that looks similar to other packages, such as Gviz.
plot_single_locus(view_range, tx_data, cvg, stacking='squish', type="scatter")
By default, browserly densely stacks all overlapping transcripts to save visual space. Future versions may allow for the display of “canonical” transcripts. All y axes default to being on the same scale, but this can also be disabled.
plot_single_locus(view_range, tx_data, cvg, type = "scatter", sync_y = FALSE)
Browserly automatcially determines the type of plot for showing coverage. When many samples are displayed, browserly defaults to using a heatmap. This saves vertical space, without losing information.
plot_single_locus(view_range, tx_data, cvg)
SNPs can be added from external sources, provided as GRanges. Here we use SNP information from the GWAS catalog.
snps <- readRDS("/gne/research/workspace/finklej/ExternalData/SNP/gwas_catalog_v1.0.1-associations_e84_r2016-07-10.rds")
snps_in_range <- get_snps_in_range(snp_gr = snps, target_range = view_range)
plot_single_locus(view_range, tx_data, cvg, type='scatter', snps = snps_in_range,
stacking = 'squish', sync_y = FALSE)
All plotly arguments that can be passed to a trace can also be passed through plot_single_locus. The outcome of these arguments is not guaranteed. Please see the plotly documentation for more information.
Browserly also provides functionality for displaying multiple genomic loci. This may be useful when the user wants to compare coverage for several genes under many different conditions. For example, prior knowledge may indicate that a set of genes shows signifcant differential expression, which may be attributable to changes in transcription factor binding between treatments.
The user specifies the genes of interest, as well as the coverage files for the samples that will be displayed.
genes
## [1] "RNF2" "E2F1" "GREB1" "BAP1"
cvg_files
## [1] "/gne/research/workspace/rinaldj1/projects/bulkChipPipeline/data/CARROLL_PR/ngs_pipeline/MCF7_Input_Full_Media_3hr/results/MCF7_Input_Full_Media_3hr.coverage.bw"
## [2] "/gne/research/workspace/rinaldj1/projects/bulkChipPipeline/data/CARROLL_PR/ngs_pipeline/SRR2000673/results/SRR2000673.coverage.bw"
## [3] "/gne/research/workspace/rinaldj1/projects/bulkChipPipeline/data/CARROLL_PR/ngs_pipeline/SRR2000679/results/SRR2000679.coverage.bw"
## [4] "/gne/research/workspace/rinaldj1/projects/bulkChipPipeline/data/CARROLL_PR/ngs_pipeline/SRR2000712/results/SRR2000712.coverage.bw"
## [5] "/gne/research/workspace/rinaldj1/projects/bulkChipPipeline/data/CARROLL_PR/ngs_pipeline/SRR2000718/results/SRR2000718.coverage.bw"
The browserly plotting function requires two additional arguments:
Here we prepare the gene regions to be compared by centering around each gene’s TSS. This provides a uniform comparison between genes.
centered_genes <- lapply(genes, function(gene){
gene_data <- get_centered_gene_info(gene = gene,
extension = extension,
cvg_files = cvg_files,
sample_names = conditions,
scaling_factor = sf,
tx_data = tx_data,
symbol_table = symbol_table)
return(gene_data)
})
cvg_list <- GRangesList(lapply(centered_genes, function(x) {x$gene_cvg}))
names(cvg_list) <- genes
tx_list <- GRangesList(lapply(centered_genes, function(x) {x$gene_tx}))
names(tx_list) <- genes
Next we plot the gene coverage for each gene under the specified conditions. By default, the subplots are not on the same scaled. This can be changed using the same sync_y argument as in the single locus view.
plot_multiple_genes(genes = genes,
centered_cvg = cvg_list,
centered_tx = tx_list)
Gene expression can be added to the plots if a named matrix is passed. The rows correspond to the genes to be plotted, and the columns are labelled by treatment condition. Columns with the same names are grouped together in box plots.
print(gene_exprs)
## E2 E2_R5020 E2_Progesterone E2_Progesterone E2_Progesterone
## RNF2 38.516690 39.405321 37.935830 41.097939 40.706197
## E2F1 4.580853 5.716005 6.380168 5.259367 5.562977
## GREB1 25.249426 25.976837 24.504076 24.370023 23.257813
## BAP1 36.467995 53.828737 52.426210 54.923232 50.771586
## E2 E2 E2_R5020 E2_Progesterone E2 E2_R5020
## RNF2 36.019152 38.266073 39.506667 38.127430 34.949106 36.921805
## E2F1 5.437536 5.717051 6.265557 5.713678 5.475839 5.441846
## GREB1 24.742712 23.968580 23.353371 24.486605 23.970045 23.683170
## BAP1 42.890593 33.353263 51.972375 55.249781 36.410661 55.047490
## E2_Progesterone E2_R5020 E2_R5020 E2_Progesterone E2
## RNF2 41.442032 38.909271 39.110774 34.501752 33.957258
## E2F1 5.233502 5.333585 4.715712 5.890774 6.124918
## GREB1 23.369010 23.596555 24.698469 22.812247 23.213565
## BAP1 51.558513 52.911366 49.867967 54.869362 40.483741
## E2 E2_R5020 E2 E2_R5020 E2 E2_Progesterone
## RNF2 36.303870 37.772537 33.61251 40.195715 35.309970 42.492230
## E2F1 5.745819 5.742541 5.43056 4.692601 4.936273 4.783116
## GREB1 26.236504 25.145120 24.59093 24.936866 27.027741 26.392113
## BAP1 38.033621 60.464678 41.32940 56.584159 44.174546 49.464618
## E2_R5020 E2_Progesterone
## RNF2 49.457395 36.838772
## E2F1 4.878229 6.387816
## GREB1 24.192221 23.500812
## BAP1 47.611069 50.668117
plot_multiple_genes(genes = genes,
centered_cvg = cvg_list,
centered_tx = tx_list,
gene_exprs = gene_exprs)